This R Markdown file includes 18 scripts used namely for collapsing and filtering Trinity transcripts that have been analyzed using Transdecoder to derive protein (ORF) predictions, analyzed using Bowtie 2 and eXpress to generate transcriptome-specific abundance estimates. Finally, differential expression between photomorphs was analyzed using edgeR. All these steps are succintly described in Materials and Methods of the above paper.
Some steps require a great deal of memory (RAM) and may take a few days to run. A computer cluster with high memory nodes was used for several scripts; these steps are denoted by the symbol “▵”.
The data consists of twelve libraries treated under two different conditions and using an poly(A) RNA selection protocol:
The following script performs quality trimming on the Illumina libraries with the program Trimmomatic using a quality cut-off of 15 in a 4-bp sliding window, discarding any reads under 36 bp. Only paired, trimmed reads were used in downstream scripts.
Additionally, in order to counter the sequence content bias attributed to random hexamer priming from PCR amplification, we clipped the first 12 bases from 5’ ends using FASTX-Toolkit in all libraries.
# bash
for i in "AS1" "AS2" "AS3" "AS4" "AS5" "AS6" "AS7" "AS8" "AS9" "AS10" "AS11" "AS12";
do
java -jar $Trimmomatic \
PE \
-phred33 \
"$i"_1.fq "$i"_2.fq "$i"_1_trimmed "$i"_1_unpaired.fq "$i"_2_trimmed.fq "$i"_2_unpaired.fq \
ILLUMINACLIP:/home/antoine_simon/prog/Trimmomatic-0.35/adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
fastx_trimmer -f 12 -i "$i"_1_trimmed.fq -o trimmomatic_minus12/"$i"_1.fq -Q33
fastx_trimmer -f 12 -i "$i"_2_trimmed.fq -o trimmomatic_minus12/"$i"_2.fq -Q33
done
We assembled all four quality-filtered Illumina libraries into a single pooled assembly using Trinity. We use a minimum length cutoff value of 400 bp. Assembly statistics are presented in the table hereafter.
# bash
module load trinity/2.6.6
module load bowtie2/2.3.3.1
module load samtools
module load perl/5.28.1
Trinity --seqType fq --left trimmomatic_minus12/AS1_1.fq,trimmomatic_minus12/AS2_1.fq,trimmomatic_minus12/AS3_1.fq,trimmomatic_minus12/AS4_1.fq,trimmomatic_minus12/AS5_1.fq,trimmomatic_minus12/AS6_1.fq,trimmomatic_minus12/AS7_1.fq,trimmomatic_minus12/AS8_1.fq,trimmomatic_minus12/AS9_1.fq,trimmomatic_minus12/AS10_1.fq,trimmomatic_minus12/AS11_1.fq,trimmomatic_minus12/AS12_1.fq --right trimmomatic_minus12/AS1_2.fq,trimmomatic_minus12/AS2_2.fq,trimmomatic_minus12/AS3_2.fq,trimmomatic_minus12/AS4_2.fq,trimmomatic_minus12/AS5_2.fq,trimmomatic_minus12/AS6_2.fq,trimmomatic_minus12/AS7_2.fq,trimmomatic_minus12/AS8_2.fq,trimmomatic_minus12/AS9_2.fq,trimmomatic_minus12/AS10_2.fq,trimmomatic_minus12/AS11_2.fq,trimmomatic_minus12/AS12_2.fq --CPU 16 --output trinity/AStotal --min_contig_length 300 --JM 250G
Then, TrinityStats.pl, a utility program provided with Trinity, was used to assess the overall quality of the assembly.
# bash
cat TrinityStats.txt
##
## ################################
## ## Counts of transcripts, etc.
## ################################
## Total trinity 'genes': 339573
## Total trinity transcripts: 412618
## Percent GC: 49.20
##
## ########################################
## Stats based on ALL transcript contigs:
## ########################################
##
## Contig N10: 6535
## Contig N20: 4363
## Contig N30: 3031
## Contig N40: 2100
## Contig N50: 1349
##
## Median contig length: 435
## Average contig: 846.31
## Total assembled bases: 349201263
##
##
## #####################################################
## ## Stats based on ONLY LONGEST ISOFORM per 'GENE':
## #####################################################
##
## Contig N10: 4051
## Contig N20: 2247
## Contig N30: 1325
## Contig N40: 859
## Contig N50: 627
##
## Median contig length: 408
## Average contig: 624.86
## Total assembled bases: 212184046
Open reading frames (ORFs) from the pooled transcriptome assembly were calculated using Transdecoder, resulting 368,720 ORFs.
# bash
module load /isg/shared/modulefiles/TransDecoder/3.0.1
module load /isg/shared/modulefiles/hmmer/3.1b2
mkdir transdecoder
cd transdecoder
#First we use the TransDecoder utility to run on the pooled transcriptome assembly.
TransDecoder.LongOrfs -t ../trinity/AStotal/Trinity.fasta
#Then we identify ORFs with homology to known proteins via pfam search; this maximizes the sensitivity for capturing the ORFs that have functional significance
hmmscan --cpu 16 --domtblout pfam.domtblout /isg/shared/databases/Pfam/Pfam-A.hmm Trinity.fasta.transdecoder_dir/longest_orfs.pep
#Finally, we predict the likely coding regions using the Pfam search output
TransDecoder.Predict -t ../trinity/AStotal/Trinity.fasta --retain_pfam_hits pfam.domtblout --cpu 16
Translated ORFs were then blasted against the NCBI non-redundant (nr) database using the DIAMOND BLASTP search (protein versus protein) in order to generate taxonomic assignments.
# bash
module load /isg/shared/modulefiles/diamond/0.9.9
diamond blastp -d ~/tempdata3/antoine/nt_diamond/nr -q /home/CAM/asimon/transdecoder_continued/Trinity.fasta.transdecoder.pep -o ~/tempdata3/antoine/DIAMOND/diamond_ALL2.txt --taxonmap ~/home/antoine_simon/database/prot.accession2taxid.gz -e 1e-6 -p 8 -f 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids -k 1
In the following chunk, the outputs of DIAMOND, Transdecoder and Trinity are slightly modified so that they can be used in the subsequent Python scripts: the two interleaved fasta files are converted into single line fasta files, and the identifiers within the Diamond output are shortened.
# bash
cd /Users/antoinesimon/Documents/Dendriscosticta/transcriptomics/NoteBook_AS/
# Trinity.fasta.transdecoder.pep: convert the multiline fasta to a single line fasta, then only keep the identifier
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < Trinity.fasta.transdecoder.pep | sed '1d' | awk 'NR%2==1' > Transdecoder_identifier.pep
# Trinity.fasta: convert the multiline fasta to a single line fasta, then only keep the identifier
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < Trinity.fasta > Trinity_1.fasta
sed '1d' Trinity_1.fasta | awk 'NR%2==1' > Trinity_identifier.fasta
# Number of rows in Trinity_identifier.fasta; you have to modify the following python script accordingly.
wc -l Trinity_identifier.fasta
# Modify the Diamond file: e.g., "TRINITY_DN10003_c0_g1::TRINITY_DN10003_c0_g1_i2::g.236161::m.236161" in "g.236161"
sed -E 's/^[^:]*::[^:]*::([^:]*)::m\.[0-9]*/\1/g' diamond_ALL2.txt > Diamond_int.txt
The three following Python scripts aim to retrieve the transcripts with the highest e-value over all ORFs or the first ORF with a blast e-value of zero on the longest splicing isoform per contig (modified versions from Spribille et al. 2016). If you happen to use these chunks of code in your research, please cite the following paper:
Spribille, T., Tuovinen, V., Resl, P., Vanderpool, D., Wolinski, H., Aime, M.C., Schneider, K., Stabentheiner, E., Toome-Heller, M., Thor, G., Mayrhofer, H., Johannesson, H., McCutcheon, J.P. (2016). Basidiomycete yeasts in the cortex of ascomycete macrolichens. Science 253, 488-492.
The following script retrieves the longest transcript (splicing isoform) from a set of Trinity transcripts assigned to a single component.
# python script
#The input file should look like the following (or the script can be modified to take some variation thereof):
# [stripped out fasta headers:]
#>c1639_g1_i1 len=370 path=[1:0-369]
#>c1639_g2_i1 len=574 path=[1246:0-573]
#>c1640_g1_i1 len=395 path=[171:0-26 170:27-394]
#>c1647_g1_i1 len=304 path=[33:0-47 32:48-303]
infilename = "Trinity_identifier.fasta"
infile = open(infilename, 'r')
infile_ncols = 1
infile_nrows = 412618
outfilename = "comp_lens.targets.longest_only.txt"
OUT = open(outfilename, 'w')
import re
import sets
def recognize_comp(input_string):
SearchStr = '(c\d+_g\d+)_i\d+'
Result = re.search(SearchStr, input_string)
return (Result.group(1))
def recognize_seq(input_string):
SearchStr = '(c\d+_g\d+_i\d+)'
Result3 = re.search(SearchStr, input_string)
return (Result3.group(1))
def recognize_len(input_string2):
SearchStr2 = 'len=(\d+)'
Result2 = re.search(SearchStr2, input_string2)
return (Result2.group(1))
linenumber = 0
comp_len_crosswalk = [[0 for x in xrange(infile_nrows)]for x in xrange(4)]
for line in infile:
if linenumber >= 0:
line = line.strip('\n')
elementlist = line.split('\t')
i = 0
for element in elementlist:
comp_len_crosswalk[0][linenumber] = element
comp_len_crosswalk[1][linenumber] = recognize_comp(str(element))
comp_len_crosswalk[2][linenumber] = recognize_seq(str(element))
comp_len_crosswalk[3][linenumber] = recognize_len(str(element))
linenumber = linenumber + 1
cl = comp_len_crosswalk[1]
contigs = list(set(cl))
len = 0
for contig in contigs:
if contig != 0:
print "For contig: ", contig
len = 0
for index in range(0, infile_nrows):
current_contig = comp_len_crosswalk[1][index]
if current_contig == contig:
if int(comp_len_crosswalk[3][index]) >= len:
len = int(comp_len_crosswalk[3][index])
transcript = comp_len_crosswalk[2][index]
my_index = index
if len != 0:
print "...and the longest transcript is: ", transcript, "with", len, "bases!"
OutputString = "%s\t%s\t%s" % (contig, transcript, len)
OUT.write(OutputString+'\n')
infile.close()
OUT.close()
In the following script, we retrieve all the ORF labels that match the longest transcripts based on a list of ORF identifiers (e.g., the output by Transdecoder) and a list of the longest transcripts per Trinity component (e.g., the output of the previous script).
# python script
#The input file should look like the following:
# [stripped out fasta headers:]
#>Gene.66844::c100013_g1_i1::g.66844::m.66844 Gene.66844::c100013_g1_i1::g.66844 ORF type:internal len:168 (+) #c100013_g1_i1:1-501(+)
#>Gene.66846::c100013_g1_i2::g.66846::m.66846 Gene.66846::c100013_g1_i2::g.66846 ORF type:internal len:134 (-) #c100013_g1_i2:1-399(-)
infilename = "Transdecoder_identifier.pep"
infile = open(infilename, 'r')
infile_ncols = 1
infile_nrows = 368720 ##### changer
infile2name = "comp_lens.targets.longest_only.txt"
infile2 = open(infile2name, 'r')
infile2_ncols = 3
infile2_nrows = 339573 ##### changer
outfilename = "target_orfs.from_longest_only.txt"
OUT = open(outfilename, 'w')
import re
import sets
def recognize_sequence(input_string):
SearchStr = '\s(c\d+_g\d+_i\d+)'
Result = re.search(SearchStr, input_string)
return (Result.group(1))
def recognize_orf(input_string2):
SearchStr2 = '>(Gene.\d+)::'
Result2 = re.search(SearchStr2, input_string2)
return (Result2.group(1))
linenumber = 0
all_orfs = [[0 for x in xrange(infile_nrows)]for x in xrange(3)]
for line in infile:
if linenumber >= 0:
line = line.strip('\n')
elementlist = line.split('\t')
i = 0
for element in elementlist:
all_orfs[0][linenumber] = element
all_orfs[1][linenumber] = recognize_orf(str(element))
all_orfs[2][linenumber] = recognize_sequence(str(element))
linenumber = linenumber + 1
linenumber = 0
longest_transcripts = [[0 for x in xrange(infile2_nrows)]for x in xrange(infile2_ncols)]
for line in infile2:
if linenumber >= 0:
line = line.strip('\n')
elementlist = line.split('\t')
i = 0
for element in elementlist:
longest_transcripts[i][linenumber] = element
i += 1
linenumber = linenumber + 1
ls = longest_transcripts[1]
long_seqs = list(set(ls))
cs = all_orfs[2]
core_seqs = list(set(cs))
elem1 = [x for x in long_seqs]
elem2 = [x for x in core_seqs]
target_seq_list = []
for item in elem1:
if item in elem2:
print 'found', item
target_seq_list.append(item)
print len(target_seq_list)
target_orf_list = []
for seq in target_seq_list:
for index in range(0, infile_nrows):
current_seq = all_orfs[2][index]
if seq == current_seq:
orf = all_orfs[1][index]
cus = all_orfs[2][index]
target_orf_list.append(orf)
print len(target_orf_list)
print seq, orf
OutputString = "%s\t%s" % (seq, orf)
OUT.write(OutputString+'\n')
infile.close()
OUT.close()
This script takes the output of a blastp query of Transdecoder predicted proteins and matches up DIAMOND results with the ORF list produced in the previous script, leaving only the ORF with the best e-value or the first ORF per transcript to have an e-value of zero.
# python script
#The input formats should be like:
# [Blast fmt6 output file with taxid annotation (last column)]
#
#Gene.66837::c100001_g1_i1::g.66837::m.66837 XP_013271523.1 51.5 101 49 0 1 101 929 1029 7.4e-21 109.0 1442369
#Gene.66838::c100004_g1_i1::g.66838::m.66838 XP_002838600.1 65.8 184 53 2 69 249 30 206 7.1e-60 240.0 #39416;656061
infilename = "Diamond_int.txt"
infile = open(infilename, 'r')
infile_ncols = 13
infile_nrows = 228074 #there are 228074 unique ORFs in this table
infile2name = "target_orfs.from_longest_only.txt"
infile2 = open(infile2name, 'r')
infile2_ncols = 2
infile2_nrows = 279344 # contains 279344 unique ORFs
outfilename = "best_ORF_by_evalue.txt"
OUT = open(outfilename, 'w')
outfilename2 = "intermediate_table.txt"
OUT1 = open(outfilename2, 'w')
import sets
import re
def recognize_sequence(input_string):
SearchStr = '(c\d+_g\d+)_i\d+'
Result = re.search(SearchStr, input_string)
return (Result.group(1))
linenumber = 0
blast_table = [[0 for x in xrange(infile_nrows)]for x in xrange(15)]
for line in infile:
if linenumber >= 0:
line = line.strip('\n')
elementlist = line.split('\t')
i = 0
for element in elementlist:
blast_table[i][linenumber] = element
i += 1
linenumber = linenumber + 1
linenumber = 0
crosswalk = [[0 for x in xrange(infile2_nrows)]for x in xrange(3)]
for line in infile2:
if linenumber >= 0:
line = line.strip('\n')
elementlist = line.split('\t')
i = 0
for element in elementlist:
crosswalk[i][linenumber] = element
crosswalk[2][linenumber] = recognize_sequence(str(crosswalk[0][linenumber]))
i += 1
linenumber = linenumber + 1
cl = crosswalk[0]
contigs = list(set(cl)) # actually transcripts
n = 0
for contig in contigs: # in this loop transcript and contig number are being pegged on to the blast table based on the target_orfs lookup list ("crosswalk")
if contig != 0:
for index in range(0,infile2_nrows):
current_contig = crosswalk[0][index]
if contig == current_contig:
#print contig
current_comp = crosswalk[2][index]
current_orf = crosswalk[1][index]
cw_contig = crosswalk[0][index]
for j in range(0,infile_nrows):
bt_orf = blast_table[0][j]
if current_orf == bt_orf:
blast_table[13][j] = contig # what is actually meant here is transcript, (comp1000_c0_seq1)
blast_table[14][j] = current_comp # and this is the contig, (comp1000_c0)_seq1
n += 1
print "first round", blast_table[0][j], blast_table[13][j], current_orf, current_contig, contig, current_comp, n
FirstOutputString = "%s\t%s\t%s\t%s\t%s\t%s\t%s" % (blast_table[0][j], blast_table[13][j], current_orf, current_contig, contig, current_comp, n)
OUT1.write(FirstOutputString+'\n')
bo = blast_table[0]
blasted_orfs = list(set(bo))
ao = crosswalk[1]
all_orfs = list(set(ao))
elem1 = [x for x in blasted_orfs]
elem2 = [x for x in all_orfs]
# validating completeness
missed_orf_list = []
for item in elem1:
if item not in elem2:
print 'missing from orf_targets_table:', item
missed_orf_list.append(item)
print len(missed_orf_list)
# and the other way around
for item in elem2:
if item not in elem1:
print 'missing from nr_query_table, perhaps no BLAST result:', item
missed_orf_list.append(item)
print len(missed_orf_list)
index = 0 # this loop should return the ORF with the lowest evalue for each transcript ("contig")
for contig in contigs:
evalue = float(1000)
if contig != 0:
for index in range(0, infile_nrows):
now_contig = blast_table[13][index]
if contig == now_contig:
current_evalue = float(blast_table[10][index])
if current_evalue <= evalue:
evalue = float(blast_table[10][index])
col1 = blast_table[0][index]
col2 = blast_table[1][index]
col3 = blast_table[2][index]
col4 = blast_table[3][index]
col5 = blast_table[4][index]
col6 = blast_table[5][index]
col7 = blast_table[6][index]
col8 = blast_table[7][index]
col9 = blast_table[8][index]
col10 = blast_table[9][index]
col12 = blast_table[11][index]
col13 = blast_table[12][index]
col14 = blast_table[13][index]
col15 = blast_table[14][index]
print contig, col1, current_evalue
if evalue != 0 and evalue != 1000:
OutputString = "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" % (col14, col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, evalue, col12, col15, col13)
print "second round1", OutputString
OUT.write(OutputString+'\n')
elif evalue == 0:
SecondOutputString = "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" % (col14, col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, evalue, col12, col15, col13)
OUT.write(SecondOutputString+'\n')
print "second round2", SecondOutputString
infile.close()
infile2.close()
OUT.close()
OUT1.close()
Below is a short shell script used to recover all the Trinity sequences that match the list of identifiers generated in the previous set of Python scripts. The output is a fasta file which only includes the transcripts with the highest e-value over all ORFs or the first ORF with a blast e-value of zero on the longest splicing isoform per contig. This file will be used in the next section to obtain counts of differential expression for each transcriptome.
Transcriptome-specific abundance estimates are obtained through four sequential steps:
Here, the eXpress output is formatted for use in the subsequent edgeR analysis for fold change estimation. The below R script produces total read count files, which will be used in analyzing the differential expression between photomorph types.
# R script
setwd("~/Documents/Dendriscosticta/transcriptomics/NoteBook_AS/")
#Import eXpress files
express_file_1 <- data.frame(read.table("AS1_results.xprs",sep="\t", header = TRUE))
express_file_2 <- data.frame(read.table("AS2_results.xprs",sep="\t", header = TRUE))
express_file_3 <- data.frame(read.table("AS3_results.xprs",sep="\t", header = TRUE))
express_file_4 <- data.frame(read.table("AS4_results.xprs",sep="\t", header = TRUE))
express_file_5 <- data.frame(read.table("AS5_results.xprs",sep="\t", header = TRUE))
express_file_6 <- data.frame(read.table("AS6_results.xprs",sep="\t", header = TRUE))
express_file_7 <- data.frame(read.table("AS7_results.xprs",sep="\t", header = TRUE))
express_file_8 <- data.frame(read.table("AS8_results.xprs",sep="\t", header = TRUE))
express_file_9 <- data.frame(read.table("AS9_results.xprs",sep="\t", header = TRUE))
express_file_10 <- data.frame(read.table("AS10_results.xprs",sep="\t", header = TRUE))
express_file_11 <- data.frame(read.table("AS11_results.xprs",sep="\t", header = TRUE))
express_file_12 <- data.frame(read.table("AS12_results.xprs",sep="\t", header = TRUE))
#Grab the target_id and tot_count columns from eXpress files
subset1 <- data.frame(express_file_1$target_id, express_file_1$tot_counts)
subset2 <- data.frame(express_file_2$target_id, express_file_2$tot_counts)
subset3 <- data.frame(express_file_3$target_id, express_file_3$tot_counts)
subset4 <- data.frame(express_file_4$target_id, express_file_4$tot_counts)
subset5 <- data.frame(express_file_5$target_id, express_file_5$tot_counts)
subset6 <- data.frame(express_file_6$target_id, express_file_6$tot_counts)
subset7 <- data.frame(express_file_7$target_id, express_file_7$tot_counts)
subset8 <- data.frame(express_file_8$target_id, express_file_8$tot_counts)
subset9 <- data.frame(express_file_9$target_id, express_file_9$tot_counts)
subset10 <- data.frame(express_file_10$target_id, express_file_10$tot_counts)
subset11 <- data.frame(express_file_11$target_id, express_file_11$tot_counts)
subset12 <- data.frame(express_file_12$target_id, express_file_12$tot_counts)
#Sort the files by target_id
sorted1 <- subset1[order(express_file_1$target_id),]
sorted2 <- subset2[order(express_file_2$target_id),]
sorted3 <- subset3[order(express_file_3$target_id),]
sorted4 <- subset4[order(express_file_4$target_id),]
sorted5 <- subset5[order(express_file_5$target_id),]
sorted6 <- subset6[order(express_file_6$target_id),]
sorted7 <- subset7[order(express_file_7$target_id),]
sorted8 <- subset8[order(express_file_8$target_id),]
sorted9 <- subset9[order(express_file_9$target_id),]
sorted10 <- subset10[order(express_file_10$target_id),]
sorted11 <- subset11[order(express_file_11$target_id),]
sorted12 <- subset12[order(express_file_12$target_id),]
write.table(sorted1, "results_AS1.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted2, "results_AS2.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted3, "results_AS3.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted4, "results_AS4.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted5, "results_AS5.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted6, "results_AS6.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted7, "results_AS7.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted8, "results_AS8.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted9, "results_AS9.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted10, "results_AS10.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted11, "results_AS11.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted12, "results_AS12.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
The output of the following shell script is a non-redundant matrix with eXpress total read counts and the various fields of the DIAMOND blast tabular file. Incidentally, the identifiers of the read count files are simplified.
# bash
for i in "AS1" "AS2" "AS3" "AS4" "AS5" "AS6" "AS7" "AS8" "AS9" "AS10" "AS11" "AS12";
do
sort -t $'\t' -k 1 results_"$i".txt > results_"$i"_sorted.txt
done
sort -t $'\t' -k 1 best_ORF_by_evalue.txt > best_ORF_by_evalue_sorted.txt
# Remove the file "all_results.tsv" from a previous run (if any).
rm all_results.tsv
# Make sure the file suffix of the new file is not .txt
OUT=all_results.tsv
touch $OUT
for i in "AS1" "AS2" "AS3" "AS4" "AS5" "AS6" "AS7" "AS8" "AS9" "AS10" "AS11" "AS12";
do
paste $OUT <(awk -F\\t '{print $2}' results_"$i"_sorted.txt) > $OUT.tmp
mv $OUT.tmp $OUT
done
paste best_ORF_by_evalue_sorted.txt all_results.tsv > best_ORF_by_evalue_eXpress.txt
The following Python script uses ETE’s ncbi_taxonomy module, which namely allows to convert taxid numbers into lineage track information. The output is a modified version of the above-mentioned matrix, with explicit taxonomic lineage for each DIAMOND blast hit. In case a hit has several taxid numbers, only the first one is considered. Subsets of the matrix can now be extracted based on taxonomic identity and used as input for normalization and differential expression analysis.
# python script
from ete3 import NCBITaxa
ncbi = NCBITaxa()
f = open('best_ORF_by_evalue_eXpress.txt')
OutFileName = 'best_ORF_by_evalue_eXpress_taxname.txt'
OutFile=open(OutFileName, 'w')
Line = f.readline()
with open('best_ORF_by_evalue_eXpress.txt') as r:
InFile = r.readlines()
while Line:
Line = f.readline()
DiamondList = Line.split('\t')
TaxID = DiamondList[14]
try:
if ";" in TaxID:
TaxID = TaxID.partition(";")[0]
lineage = ncbi.get_lineage(TaxID)
names = ncbi.get_taxid_translator(lineage)
lignee = [names[taxid] for taxid in lineage]
OutputString = "%s\t%s\t" %(Line.replace("\n", ""), lignee)
print OutputString
OutFile.write(OutputString+"\n")
else:
lineage = ncbi.get_lineage(TaxID)
names = ncbi.get_taxid_translator(lineage)
lignee = [names[taxid] for taxid in lineage]
OutputString = "%s\t%s\t" %(Line.replace("\n", ""), lignee)
print OutputString
OutFile.write(OutputString+"\n")
except:
pass
f.close()
OutFile.close()
Taxon subsets of the matrix were created using the following custom bash script and the lineage track information generated by the previous Python script. Additionally, lineages are converted into a single a taxonomic level. The code can be modified and fine-tuned in order to obtain taxonomic subsets with varying degrees of specificity (i.e., different taxonomic ranks). Rare taxa (i.e., that make up for less than 1 percent of the total) are merged into a single category named “Others”.
# bash
for i in "Rhizobiales" "Ascomycota" "Lecanoromycetes" "Methylobacteriaceae" "Bacteria" "root" "Fungi";
do
cd /Users/antoinesimon/Documents/Dendriscosticta/transcriptomics/NoteBook_AS/
mkdir "$i"/
cd "$i"
grep "u'$i'" ../best_ORF_by_evalue_eXpress_taxname.txt | grep "u'$i'" > "$i".txt
# Remove lines with 'short' tax strings. Search term must be identical to the one in the subsequent perl command. Add/remove one (or several) \, u\'.*\' to get more/less specific taxonomic units and modify the captured group in the perl command (i.e., move the parentheses)
grep "u\'root'\, u\'cellular organisms\'\, u\'.*\'\, u\'.*\'\, u\'.*\'\, u\'.*\'\, u\'.*\'\, u\'.*\'" "$i".txt > best_ORF_by_evalue_eXpress_taxname.txt
# Store individual expression counts.
cut -f1,17 best_ORF_by_evalue_eXpress_taxname.txt > results_1.txt &&
cut -f1,18 best_ORF_by_evalue_eXpress_taxname.txt > results_2.txt &&
cut -f1,19 best_ORF_by_evalue_eXpress_taxname.txt > results_3.txt &&
cut -f1,20 best_ORF_by_evalue_eXpress_taxname.txt > results_4.txt &&
cut -f1,21 best_ORF_by_evalue_eXpress_taxname.txt > results_5.txt &&
cut -f1,22 best_ORF_by_evalue_eXpress_taxname.txt > results_6.txt &&
cut -f1,23 best_ORF_by_evalue_eXpress_taxname.txt > results_7.txt &&
cut -f1,24 best_ORF_by_evalue_eXpress_taxname.txt > results_8.txt &&
cut -f1,25 best_ORF_by_evalue_eXpress_taxname.txt > results_9.txt &&
cut -f1,26 best_ORF_by_evalue_eXpress_taxname.txt > results_10.txt &&
cut -f1,27 best_ORF_by_evalue_eXpress_taxname.txt > results_11.txt &&
cut -f1,28 best_ORF_by_evalue_eXpress_taxname.txt > results_12.txt &&
# Store the individual expression counts in one file.
paste -d "\t" results_1.txt results_2.txt results_3.txt results_4.txt results_5.txt results_6.txt results_7.txt results_8.txt results_9.txt results_10.txt results_11.txt results_12.txt | cut -f2,4,6,8,10,12,14,16,18,20,22,24 > results_ALL.txt &&
# Add a header.
echo -e "AS1\tAS2\tAS3\tAS4\tAS5\tAS6\tAS7\tAS8\tAS9\tAS10\tAS11\tAS12" | cat - results_ALL.txt > temp && mv temp results_ALL.txt &&
# Keep one taxonomic level.
perl -lne "print \$1 if /u\'root'\, u\'cellular organisms\'\, u\'.*?\'\, u\'.*?\'\, u\'.*?\'\, u\'.*?\'\, u\'.*?\'\, u\'(.*?)\'/" best_ORF_by_evalue_eXpress_taxname.txt > Tax_ALL.txt &&
# Count number of occurrences in each category.
sort Tax_ALL.txt | uniq -c | sed 's/^ *//' > Tax_all_count.txt &&
# Add all the occurrences.
awk -F, '{sum+=$1} END {print sum}' Tax_all_count.txt > sum.txt &&
# Convert number of occurrences into floating numbers (fraction).
awk -v var2=`cat sum.txt` '{ print ( $1 / var2 ) }' Tax_all_count.txt > fraction.txt &&
# Remove the file "fraction2.txt" from a previous run (if any).
rm fraction2.txt
# Convert scientific notation to decimal.
for n in `cat fraction.txt`; do XXX=$(printf "%.14f" $n) ; echo $XXX >> fraction2.txt ; done &&
# Store the rare taxa (below 1%) into one file.
paste -d" " fraction2.txt Tax_all_count.txt | awk '$1<0.01' | perl -lne "print \$1 if /\d\.\d+ \d+ (.*)/" > Tax_1percent.txt &&
# Get rid of the special characters and spaces.
sed -i '' -e 's/[^a-zA-Z]/_/g' Tax_1percent.txt &&
sed -i '' -e 's/[^a-zA-Z]/_/g' Tax_ALL.txt &&
# Replace rare taxa by the category 'Others'.
for n in `cat Tax_1percent.txt`; do sed -i '' -e "s/^$n$/Others/g" Tax_ALL.txt ; done
# Isolate the sequence identifier
cut -f1 best_ORF_by_evalue_eXpress_taxname.txt > identifier.txt
done
Differential expression between the photomorph types, which can also reflect dissimilar cell abundances, was assessed using the Bioconductor package edgeR and Fisher’s exact test. In each putative taxonomic subset, only transcripts with counts per million (CPM) of 20 or greater for at least two samples were included in the analysis. Additionally, samples were clustered in a multidimensional scaling plot (MDS) using the plotMDS function implemented in the Bioconductor package limma in order to assess the adequacy of the differential expression analyses.
# R script
directory <- "/Users/antoinesimon/Documents/Dendriscosticta/transcriptomics/Notebook_AS/Fungi/"
outputPrefix <- "edgeR"
setwd(directory)
library(plyr)
library(edgeR)
## Loading required package: limma
library(ggplot2)
library(randomcoloR)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following objects are masked from 'package:plyr':
##
## arrange, mutate, rename, summarise
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(bsselectR)
library(stringr)
sampleCondition <- c("Cyanomorph",
"Cyanomorph",
"Chloromorph",
"Chloromorph",
"Chloromorph",
"Chloromorph",
"Chloromorph",
"Cyanomorph",
"Cyanomorph",
"Chloromorph",
"Cyanomorph",
"Cyanomorph")
eXpress <- data.frame(read.table("results_ALL.txt",sep="\t", header = TRUE))
Tax <- data.frame(read.table("Tax_ALL.txt",sep="\t", header = FALSE))
Identifier <- data.frame(read.table("identifier.txt",sep="\t", header = FALSE))
#####edgeR
d <- DGEList(counts=eXpress,group=factor(sampleCondition))
d
## An object of class "DGEList"
## $counts
## AS1 AS2 AS3 AS4 AS5 AS6 AS7 AS8 AS9 AS10 AS11 AS12
## 1 3 3 2 3 0 0 9 0 0 0 0 0
## 2 0 0 0 2 0 1 0 2 0 0 0 0
## 3 2 0 0 0 0 2 0 0 0 0 0 0
## 4 0 0 0 7 0 0 0 2 2 0 0 0
## 5 0 0 0 0 0 0 0 1 0 0 0 0
## 79411 more rows ...
##
## $samples
## group lib.size norm.factors
## AS1 Cyanomorph 16472704 1
## AS2 Cyanomorph 5190286 1
## AS3 Chloromorph 13221484 1
## AS4 Chloromorph 13910906 1
## AS5 Chloromorph 12228978 1
## 7 more rows ...
dim(d)
## [1] 79416 12
d.full <- d
head(d$counts)
## AS1 AS2 AS3 AS4 AS5 AS6 AS7 AS8 AS9 AS10 AS11 AS12
## 1 3 3 2 3 0 0 9 0 0 0 0 0
## 2 0 0 0 2 0 1 0 2 0 0 0 0
## 3 2 0 0 0 0 2 0 0 0 0 0 0
## 4 0 0 0 7 0 0 0 2 2 0 0 0
## 5 0 0 0 0 0 0 0 1 0 0 0 0
## 6 2 2 1 4 0 0 10 0 0 0 0 0
head(cpm(d))
## AS1 AS2 AS3 AS4 AS5 AS6 AS7 AS8
## 1 0.1821195 0.5780028 0.15126895 0.2156581 0 0.00000000 0.6755996 0.00000000
## 2 0.0000000 0.0000000 0.00000000 0.1437721 0 0.06875718 0.0000000 0.19113546
## 3 0.1214130 0.0000000 0.00000000 0.0000000 0 0.13751436 0.0000000 0.00000000
## 4 0.0000000 0.0000000 0.00000000 0.5032023 0 0.00000000 0.0000000 0.19113546
## 5 0.0000000 0.0000000 0.00000000 0.0000000 0 0.00000000 0.0000000 0.09556773
## 6 0.1214130 0.3853352 0.07563447 0.2875442 0 0.00000000 0.7506663 0.00000000
## AS9 AS10 AS11 AS12
## 1 0.0000000 0 0 0
## 2 0.0000000 0 0 0
## 3 0.0000000 0 0 0
## 4 0.1002863 0 0 0
## 5 0.0000000 0 0 0
## 6 0.0000000 0 0 0
apply(d$counts, 2, sum)
## AS1 AS2 AS3 AS4 AS5 AS6 AS7 AS8
## 16472704 5190286 13221484 13910906 12228978 14543936 13321499 10463783
## AS9 AS10 AS11 AS12
## 19942910 15499038 2956313 20568914
keep <- rowSums(cpm(d)>20) >= 2
d <- d[keep,]
dim(d)
## [1] 5547 12
Tax <- Tax[keep,]
Identifier <- Identifier[keep,]
d$samples$lib.size <- colSums(d$counts)
d$samples
## group lib.size norm.factors
## AS1 Cyanomorph 16252946 1
## AS2 Cyanomorph 5063148 1
## AS3 Chloromorph 13066144 1
## AS4 Chloromorph 13687450 1
## AS5 Chloromorph 12151907 1
## AS6 Chloromorph 14266025 1
## AS7 Chloromorph 13040150 1
## AS8 Cyanomorph 10057223 1
## AS9 Cyanomorph 19236602 1
## AS10 Chloromorph 15391433 1
## AS11 Cyanomorph 2880101 1
## AS12 Cyanomorph 20325175 1
#Normalizing the data
d <- calcNormFactors(d)
d
## An object of class "DGEList"
## $counts
## AS1 AS2 AS3 AS4 AS5 AS6 AS7 AS8 AS9 AS10 AS11 AS12
## 1314 333 350 439 185 0 244 732 2 0 0 1 5
## 1386 89 34 1319 1042 53 105 939 1033 187 128 29 212
## 1759 112 142 70 37 0 53 309 0 0 0 0 0
## 2402 1074 313 457 670 799 643 534 453 909 807 90 1095
## 2403 8235 3160 7749 8065 6403 10022 7742 5715 13155 9339 1435 15567
## 5542 more rows ...
##
## $samples
## group lib.size norm.factors
## AS1 Cyanomorph 16252946 1.0678444
## AS2 Cyanomorph 5063148 1.0610315
## AS3 Chloromorph 13066144 0.7174709
## AS4 Chloromorph 13687450 1.0453064
## AS5 Chloromorph 12151907 1.1204090
## 7 more rows ...
#Data Exploration
colors <- rep(c("green4", "blue4"), 2)
plotMDS(d, method="bcv", col=colors[d$samples$group], labels = NULL, pch =1)
#Estimating the Dispersion
d1 <- estimateCommonDisp(d, verbose=T)
## Disp = 0.33078 , BCV = 0.5751
names(d1)
## [1] "counts" "samples" "common.dispersion"
## [4] "pseudo.counts" "pseudo.lib.size" "AveLogCPM"
d1 <- estimateTagwiseDisp(d1)
names(d1)
## [1] "counts" "samples" "common.dispersion"
## [4] "pseudo.counts" "pseudo.lib.size" "AveLogCPM"
## [7] "prior.df" "prior.n" "tagwise.dispersion"
## [10] "span"
plotBCV(d1)
#Differential Expression
et12 <- exactTest(d1, pair=c(1,2)) # compare groups 1 and 2
topTags(et12, n=10)
## Comparison of groups: Cyanomorph-Chloromorph
## logFC logCPM PValue FDR
## 28662 -9.092309 6.585019 1.744433e-81 9.676370e-78
## 27948 -9.049912 7.399528 1.686934e-72 4.678712e-69
## 33726 -10.069914 5.342788 3.257219e-40 4.609621e-37
## 43330 -9.417853 5.540635 3.324046e-40 4.609621e-37
## 24288 -6.813098 6.870328 5.064985e-36 5.619095e-33
## 50392 -8.579662 5.613978 2.159788e-35 1.996724e-32
## 5697 -8.026864 5.398691 4.380005e-35 3.470841e-32
## 22456 -5.188234 6.477067 5.049360e-31 3.501100e-28
## 28917 -8.711898 5.454819 4.292425e-30 2.645565e-27
## 18745 3.917930 6.657045 2.080585e-29 1.154100e-26
#FDR and save
DEG_out <- topTags(et12, n = "Inf")$table
Identifier_frame <- as.data.frame(Identifier)
DEG_out <- DEG_out[order(as.numeric(rownames(DEG_out))),,drop=FALSE]
Pairwise_results <- cbind(Identifier_frame, DEG_out)
colnames(Pairwise_results)[1] <- "Name"
write.table(Pairwise_results, "Pairwise_results.txt", row.names=FALSE, quote=FALSE, sep='\t')
UP_DOWN <- Pairwise_results[(Pairwise_results[,5]<0.05),]
UP <- UP_DOWN[(UP_DOWN[,2]>0),]
DOWN <- UP_DOWN[(UP_DOWN[,2]<0),]
UP <- UP[1]
DOWN <- DOWN[1]
write.table(UP, "UP.txt", row.names=FALSE, col.names = FALSE, quote=FALSE, sep='\t')
write.table(DOWN, "DOWN.txt", row.names=FALSE, col.names = FALSE, quote=FALSE, sep='\t')
de1 <- decideTestsDGE(et12, adjust.method="BH", p.value=0.05)
summary(de1)
## Cyanomorph-Chloromorph
## Down 171
## NotSig 5207
## Up 169
# Taxonomy
Tax_unlisted <- unlist(Tax, use.names = FALSE)
plotSmear(et12, de.tags=Tax_unlisted, cex=0.7, col="grey")
abline(h = c(-2, 2), col = "blue")
# P-values , 0.05
de1tags12 <- rownames(d1)[as.logical(de1)]
plotSmear(et12, de.tags=de1tags12, cex=0.48, col="grey")
abline(h = c(-2, 2), col = "blue")
n <- 21
palette_random <- distinctColorPalette(n)
p =ggplot(et12$table, aes(logCPM, logFC, col=Tax, text =Identifier)) + geom_point(alpha=0.55) + scale_colour_manual(values=c(palette_random))
q <- p + theme_bw() +
#eliminates background, gridlines, and chart border
theme(
plot.background = element_blank()
,panel.grid.major = element_blank()
,panel.grid.minor = element_blank()
,panel.border = element_blank()
) +
#draws x and y axis line
theme(axis.line = element_line(color = "black")) +
#add horizontal lines
geom_hline(yintercept=2, color = "grey") +
geom_hline(yintercept=-2, color = "grey")
ggplotly(q)
# Uncomment for non-interactive graphs:
# plot(q)
The Blast2GO methodology as implemented in OmicsBox was used to assign Gene Ontology (GO) terms to the transcripts. But first, the DIAMOND blast results had to be generated in XML format, so that they could be directly loaded into the GUI.
# bash
module load /isg/shared/modulefiles/diamond/0.9.09
diamond blastx -d ~/tempdata3/antoine/nt_diamond/nr -q express_new/best_orfs_Trinity.fasta -o AS/Diamond_b2g_CDS.xml --taxonmap ~/home/antoine_simon/database/prot.accession2taxid.gz -e 1e-6 -p 8 -f 5 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids -k 1
head -n20 Diamond_b2g_CDS.xml >> AS/Diamond_b2g_CDS_fungi.xml
for n in `cat AS/_fungi_identifier.txt `; do LC_ALL=C fgrep "$n" AS/Diamond_b2g_CDS.xml -B3 -A43 ; done >> AS/Diamond_b2g_CDS_fungi.xml
tail -n2 Diamond_b2g_CDS.xml >> AS/Diamond_b2g_CDS_fungi.xml
grep -w 'Iteration_query\-def\|Hsp_qseq' Diamond_b2g_CDS_fungi.xml | sed 's/<Iteration_query\-def>/>/g' | sed 's/<\/Iteration_query\-def>//g' | sed 's/<Hsp_qseq>//g' | sed 's/<\/Hsp_qseq>//g' >> b2g_fungi.fasta
The following script was used to combine the results of the GO enrichment and DEG analyses by building a heatmap displaying the most specific enriched Gene Ontology (GO) terms identified for statistically-significant fungal transcripts. Enriched terms and transcripts are organized by hierarchical clustering. DEG significance scores correspond to the logarithm of the fold change (logFC) between photomorphs (down: chloromorphs; up: cyanomorphs). GO term significance scores correspond the negative logarithm of P-values evaluating the significance of GO terms.
# R script
#This chunk mostly follows the script written by Kevin Blighe on https://www.biostars.org/p/299161/
directory <- "/Users/antoinesimon/Documents/Dendriscosticta/transcriptomics/Notebook_AS/Fungi/"
setwd(directory)
library("ComplexHeatmap")
require("circlize")
library("viridis")
Pairwise_results <- cbind(Identifier_frame, DEG_out)
colnames(Pairwise_results)[1] <- "Name"
log2FCcutoff <- 0.05
BHcutoff <- 0.05
sigGeneList <- subset(Pairwise_results, abs(logFC)>=log2FCcutoff & PValue<=BHcutoff)[,1]
DAVIDfile <- "DEG_specific20for2_removed_too_general.txt"
DAVID <- read.table(DAVIDfile, sep="\t", header=TRUE)
colnames(DAVID)
names(DAVID)[3] <- "GO_Name"
names(DAVID)[6] <- "P_Value"
names(DAVID)[11] <- "TestSet_Sequences"
colnames(DAVID)
DAVID$TestSet_Sequences <- gsub(';', ', ', DAVID$TestSet_Sequences)
DAVID$TestSet_Sequences <- gsub('$', ', x', DAVID$TestSet_Sequences)
write.table(DAVID, "omicsbox_table2.txt", sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)
DAVIDfile <- "omicsbox_table2.txt"
#Enrichment cut-off
enrichBcutoff <- 1e-02 # should be way smaller
#DAVID <- subset(DAVID, P_Value<enrichBcutoff)
DAVID <- DAVID[,c(4,3,7,11)]
#Create a new dataframe that has '1' for when the gene is part of a term, and '0' when not
annGSEA <- data.frame(row.names=sigGeneList)
for (j in 1:length(sigGeneList))
{
gene <- sigGeneList[j]
pattern <- paste("^", gene, ", |, ", gene, "$| ", gene, ",", sep="")
for (k in 1:nrow(DAVID))
{
if (any(grepl(pattern, DAVID$TestSet_Sequences[k])))
{
annGSEA[j,k] <- 1
}
else
{
annGSEA[j,k] <- 0
}
}
}
colnames(annGSEA) <- DAVID[,2]
#Remove terms with no overlapping genes
annGSEA <- annGSEA[,apply(annGSEA, 2, mean)!=0]
#Remove genes with no overlapping terms
annGSEA <- annGSEA[apply(annGSEA, 1, mean)!=0,]
#Match the order of rownames in DESeq_output with that of annGSEA
Pairwise_results <- Pairwise_results[which(Pairwise_results$Name %in% rownames(annGSEA)),]
Pairwise_results <- Pairwise_results[match(rownames(annGSEA), Pairwise_results$Name),]
#Set text and figure dimensions
geneLab=6
termLab=8
#Create heatmap annotations
#Color bar for fold changes
dfMinusLog10FDRGenes <- data.frame(abs(Pairwise_results[which(Pairwise_results[,1] %in% rownames(annGSEA)),"logFC"]))
dfMinusLog10FDRGenes[dfMinusLog10FDRGenes=="Inf"] <- 0
dfFoldChangeGenes <- data.frame(Pairwise_results[which(Pairwise_results[,1] %in% rownames(annGSEA)),"logFC"])
dfGeneAnno <- data.frame(dfMinusLog10FDRGenes, dfFoldChangeGenes)
dfGeneAnno[,2] <- ifelse(dfGeneAnno[,2]>0, "Cyanomorph", "Chloromorph")
colnames(dfGeneAnno) <- c("DEG\nsignificance\nscore", "Regulation")
colours <- list("Regulation"=c("Cyanomorph"="blue4", "Chloromorph"="green4"))
haGenes <- rowAnnotation(df=dfGeneAnno, col=colours, width=unit(1,"cm"))
#Color bar for GO Term significance score
dfMinusLog10BenjaminiTerms <- data.frame(-log10(read.table(DAVIDfile, sep="\t", header=TRUE)[which(read.table(DAVIDfile, sep="\t", header=TRUE)$GO_Name %in% colnames(annGSEA)),"P_Value"]))
colnames(dfMinusLog10BenjaminiTerms) <- "GO Term\nsignificance\nscore"
col_viridis = colorRamp2(c(2.63, 1.2), viridis(2))
haTerms <- HeatmapAnnotation(df = dfMinusLog10BenjaminiTerms, col = list(`GO Term
significance
score` = col_viridis), colname=anno_text(colnames(annGSEA), rot=40, just="right", location=unit(1,"npc")-unit(2,"mm"), gp=gpar(fontsize=termLab)), annotation_height=unit.c(unit(1, "cm"), unit(8, "cm")))
pdf("GO.pdf", width=12, height=10)
hmapGSEA <- Heatmap(annGSEA,
name="My enrichment",
split=dfGeneAnno[,2],
col=c("0"="white", "1"="grey45"),
rect_gp=gpar(col="grey85"),
cluster_rows=T,
show_row_dend=T,
row_title="Statistically-significant genes",
row_title_side="left",
row_title_gp=gpar(fontsize=12, fontface="bold"),
row_title_rot=0,
show_row_names=TRUE,
row_names_gp=gpar(fontsize=geneLab, fontface="bold"),
row_names_side="left",
row_names_max_width=unit(15, "cm"),
row_dend_width=unit(10,"mm"),
cluster_columns=T,
show_column_dend=T,
column_title="Enriched terms",
column_title_side="top",
column_title_gp=gpar(fontsize=12, fontface="bold"),
column_title_rot=0,
show_column_names=FALSE,
show_heatmap_legend=FALSE,
clustering_distance_columns="euclidean",
clustering_method_columns="ward.D2",
clustering_distance_rows="euclidean",
clustering_method_rows="ward.D2",
bottom_annotation=haTerms)
draw(hmapGSEA + haGenes, heatmap_legend_side="right", annotation_legend_side="right")
dev.off()